Capturing Semantic Hierarchies to Perform Meaningful Integration in HTML Tables

نویسندگان

  • Shijun Li
  • Mengchi Liu
  • Guoren Wang
  • Zhiyong Peng
چکیده

We present a new approach that automatically captures the semantic hierarchies in HTML tables, and semi-automatically integrates HTML tables belonging to a domain. It first automatically captures the attribute-value pairs in HTML tables by normalization and recognizing their headings. After generating global schema manually, it learns the lexical semantic sets and contexts, by which it then eliminates the conflicts and solves the nondeterministic problems in mapping each source schema to the global schema to integrate the data in HTML tables.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Semantic Approach to Internet Tabular Information Extraction

Extracting information from tables is essential for Internet information extraction. However, most web tables are designed in HTML format. To decipher their semantic meanings a system needs to deal with various layouts, which is quite cumbersome. Previous works have two major approaches: layout enumeration approach and wrapper approach. The first approach is to match the table with presorted la...

متن کامل

WInte.r - A Web Data Integration Framework

The Web provides a plethora of structured data, such as semantic annotations in web pages, data from HTML tables, datasets from open data portals, or linked data from the Linked Open Data Cloud. For many use cases, it is necessary to integrate such web data with existing local datasets. This integration entails schema matching, identity resolution, as well as data fusion. As an alternative to u...

متن کامل

Semantic Web Rules for Business Information

A description of the New Brunswick Business Knowledge Base (NBBizKB) is provided and is made available online in RuleML. NBBizKB realizes a two-step design. First, business facts are extracted, once from static CSV tables and, repeatedly from dynamic semi-structured HTML pages. Second, Semantic Web rules are developed to derive information implicit in the fact base. Fact extraction comprises an...

متن کامل

Understanding Tables on the Web

The Web contains a wealth of information, and a key challenge is to make this information machine processable. Because natural language understanding at web scale remains difficult and costly at present, in this paper, we focus our attention on understanding well-structured html tables on the Web. From 0.3 billion Web documents, we obtain 1.95 billion tables, and 0.5-1% of these contain meaning...

متن کامل

Extracting Knowledge Bases from table-structured Web Resources applied to the semantic based Requirements Engineering Methodology SoftWiki

A lot of information on the Web is provided as HTML formatted tables and CSV files. Such tables contain semantic information that can be derived from the embedded environment of the table as well from the heading of each column. Often the problem of integrating and linking this information into semantic web applications occurs. One way to solve this is a transformation of these tables into OWL ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004